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1 Introduction 



This article develops sequential regression trees, with implementation details and examples for 
applications in experiment design, optimization, classification, and on-line learning. Our most 
basic insight is the characterization of partition trees as a dynamic model state. This allows the 
model to grow with data, such that each new observation leads to only small changes in the 
posterior, and has the practical effect of reducing tree space dimension without affecting model 
flexibility. A Bayesian inferential framework is built around particle learning algorithms for the 
efficient on-line filtering of tree-states, and examples demonstrate that the proposed approach 
is able to provide better results compared to commonly used methods at a fraction of the cost. 

We consider two generic regression formulations. The first has real-valued response (uni- 
variate here, but this is not a general restriction) with unspecified mean function and Gaussian 
noise, and the second has a categorical response with covariate dependent class probabilities: 

a. y = /(x)+£, £~N(0,(7^) b. p(y = c) = Pe(x), c=l,...,a (1) 

The regression tree framework provides a simple yet powerful solution to modeling these re- 
lationships in situations where very little is known a priori. For such models, the covariate 
space is partitioned into a set of hyper-rectangles, and a simple model (e.g., linear mean and 
constant error variance) is fit within each rectangle. This adheres to a general strategy of having 
a state-space model (i.e., the partition structure) induce flexible prediction despite a restrictive 
model formulation conditional on this state. For trees, our strategy allows posterior inference 
to be marginalized over the parameters of simple models in each partition. Given a particular 
tree, predictive uncertainty is available in closed form and does not depend upon approximate 
posterior sampling of model parameters. Uncertainty updating is then a pure filtering prob- 
lem, requiring posterior sampling for the partition tree alone. Thereby this article establishes a 
natural sequential framework for the specification and learning of regression trees. 
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Although partitioning is a rough modeling tool, trees will often be better suited to real- 
world applications than other "more sophisticated" alternatives. Compared to standard basis 
function models for nonparametric regression, the dynamic trees proposed herein provide some 
appealing properties (e.g., flexible response surface, nonstationarity, heteroskedasiticity) at the 
expense of others (e.g., smoothness, explicit correlation structure) within a framework that is 
trivial to specify and allows for conditional inference, given the global partition tree-state, to 
be marginalized over all model parameters. Prediction is very fast, requiring only a search 
for the rectangle containing a new x, and individual realizations yield an easily interpretable 
decision tree for regression and classification problems. Finally, and perhaps most importantly, 
the models are straightforward to fit through sequential particle algorithms, and thus offer a 
major advantage in on-line inference problems. 

Consider the setting of sequential experiment design, which serves as one of the motivators 
for this article. In engineering applications, and especially for computer experiments, the typ- 
ical model of ([TJa) is built from a stationary Gaussian process (GP) prior for / with constant 



additive error variance (see, e.g., Santner et al. , 2003). This GP specification is very flexible. 



but inference for sample size n is 0{n^) due to the need to decompose n x n matrices. Fur- 
thermore, full Bayesian inference is usually built around MCMC sampling, which demands 
0{Bn^) computation for a posterior sample of size B. Sequential design requires this batch 
sampling to be restarted whenever new [x, y] pairs are added into the data, making an already 
computationally intensive procedure even more expensive by failing to take advantage of the 
sequential nature of the problem. With lower order inference (i.e., only a search of partitions 
followed by simple leaf regression model calculations) and an explicit dynamic structure, our 
dynamic regression trees are less expensive and can be fit in serial. 

Furthermore, it is essential that the fitted model is appropriate for the problem of sequential 
design. In this, the standard choice of a stationary GP will again often fall short. Whether 
designing for optimization or for general learning, the goal is to search for distinctive areas of 



the feature space that warrant further exploration (e.g., due to high variance or low expected 
response). The single most appropriate global specification may be very different from the ideal 
local model in areas of interest (e.g., along breaks in the response surface or near optima), such 
that it is crippling to assume stationarity or constant error variance. However, nonstationary GP 



modeling schemes (e.g. |Higdon et al.[|1999[ ) require even more computational effort than their 
stationary counterparts. In contrast, nonstationary function means and nonconstant variance 
are fundamental characteristics of tree-based regression. 

These considerations are meant to illustrate a standard point: It may be that some flexible 
functions {/, a} or p axe ideal for modeling ([T]a-b) in any particular setting, but that a simple 
and fast approximation - with predictions averaged over posterior uncertainty - is better suited 
to the design scenario or classification problem at hand. We thus put forth dynamic regression 
trees as a robust class of models with some key properties: prior specification is scale-free and 
automatic; inference is marginalized over model parameters given a global partition state; it is 
possible to sequentially filter uncertainty about the partition state; and the resulting prediction 
surfaces are appropriate for modeling complicated regression functions. 

The remainder of the paper is organized as follows. Section [TT| provides a survey of parti- 



tion tree models and 1.2 contains a review of Bayesian inference for trees. Our core method- 
ology is outlined in Section |2} which develops a sequential characterization of uncertainty up- 



dating for regression trees. The partition evolution is defined in 2.1 , leaf model and prediction 
details are in 2.2 , a particle learning algorithm for posterior simulation is provided in |2.3[ and 
marginal likelihood estimation is discussed in |2.4[ Section |3] then describes a set of application 



examples. First, 3.1 illustrates linear and constant mean regression models on some simple 
datasets, and compares results to alternatives from the literature. We consider the sequential 
design of experiments, with optimization search in 3.2 and active learning in 3.3 Finally, 3.4| 
provides two classification examples. Section]?] contains short closing discussion. 
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1.1 Partition Trees 

The use of partition trees to represent input-output relationships is a classic nonparametric 
modeling technique. A decision tree is imposed with switching on input variables and terminal 
node predictions for the relevant output. Although other schemes are available (e.g., based 
on Voronoi tessellations), the standard approach relies on a binary recursive partitioning of 



input variables, as in the classification and regression tree (CART) algorithms of Breiman et al. 



(1984). This forces axis-aligned partitions, although pre-processing data transformations can 
be used to alter the partition space. In general, the computational and conceptual simplicity of 
rectangular partitions will favor them over alternative schemes. As binary recursive partition 
trees are fundamental to our regression model, we shall outline some notation and details here. 

Consider covariates x* = {x^}*^^. A corresponding tree T consists of a hierarchy of nodes 
1] E T associated with different subsets of x*. The subsets are determined through a series of 
splitting rules, and these rules also dictate the terminal node associated with any new x. Every 
tree has a root node, Rr, which includes all of x*, and every node r] is itself a root for a sub-tree 
containing nodes (and associated subsets of x*) below r] in the hierarchical structure defined by 
T. A node is positioned in this structure by its depth, Dr,, defined as the number of sub-trees 
of T other than T which contain r] (e.g., Dr^ = 0). 

Figure [T] shows two diagrams of local tree structure (partition trees tend to grow upside- 
down). Graph (i) illustrates the potential neighborhood of each node i] E T, and graph (ii) 
shows a hypothetical local tree formed through recursive binary partitioning. Left and right 
children, Ci{rj) and Cr{r]), are disjoint subsets such that Ci{ii]) U Cr{rj) = rj; and the parent 
node, P{r]), contains both 77 and its sibling node S{r]), where r]r\S{rj) = and ■r]US{ri) = P{rj). 
Node ancestors are parents which contain a given node, and a node's depth is equivalent to its 
number of ancestors. Any of the neighbors in {i) may be nonexistent for a specific node; for 
example, root Rj- has no parent or sibling. If a node has children it is considered internal, 
otherwise it is referred to as a leaf node. The set of internal nodes for T is denoted Iq-, and the 
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Figure 1: Tree heuristics. Graph (a) shows the parent-sibling-child network, and (6) illustrates parti- 
tioning on two hypothetical covariates (internal nodes show spUt points, and leafs show the node set). 

set of leaves is L7-. 

The tree is completed with a decision rule (i.e., a simple regression or classification model) 
at each leaf. Suppose that every covariate vector x^ is accompanied by response yg, such that 
the complete data is [x, y]* = {x>,, y^}*^]^. Then, with regression models parametrized by 
for each leaf 77 e L7-, independence across tree partitions leads to likelihood p(|/* | x*, T, 6) — 
rir? e Lr P I ^'j)' where [x, — {xj, yj : Xj e 77} is the data subset allocated to 77. 

1 . 2 Inference for Tree Models 

A novel approach to regression trees was pioneered by Chipman, George, and McCuUoch 
(CGM: 1998, 2002), who designed a prior distribution, 7r(T), over possible partition struc- 
tures. This allows for coherent inference via the posterior, p(T | [x, yf) oc p(|/* | T, x*)7r(T), 
including the assessment of partition uncertainty and the calculation of predictive bands. 

The CGM tree prior is generative, in that it specifies tree probability by placing a prior 
on each individual partition rule. Any given leaf node 77 may be split with depth-dependent 
probability Pspiit(T, rj) — a{l + D^)"^, where a, /? > 0. The coordinate (i.e. dimension of x) 
and location of the split, (i, x)n, have independent prior pruie(T, 7/), which is typically a discrete 
uniform distribution over all potential split points in x^ = x* n 77. Implicit in the prior is the 
restriction that a partition may not be created if it would contain too few data points to fit the 
leaf model (i.e., > 3 for the constant model and > 0? + 2 for a linear model with d covariates). 
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Ignoring invalid partitions, the joint prior is thus 



T^iDoc W Psp\it{T,'n) JJ [1 -Pspiit(T,?7)]. (2) 

That is, the tree prior is the probability that internal nodes have split and leaves have not. 

In their seminal paper, CGM develop a Metropolis-Hastings MCMC approach for sampling 
from the posterior distribution of partition trees. This algorithm is able to explore posterior 
tree space by stochastically proposing incremental modifications to T {grow, prune, change, 
and swap moves) which are accepted according to the appropriate Metropolis-Hastings ratio. 
Although this approach provides a search of posterior trees, the resulting Markov chain will 
generally have poor mixing properties. In particular, the proposed tree moves are very local. 
An improbable sequence of (first) prunes and (then) grows are required to move between parts 
of tree space that have similar posterior probability, but yet represent drastically different hier- 
archical spht rules. Furthermore, MCMC is a batch algorithm and must be re-run for each new 
observation, making it inefficient for on-line applications. 

The next section reformulates regression trees as a dynamic model. Our novel characteri- 
zation leads to an entirely new class of models, and we develop a sequential particle inference 
framework that is able to alleviate many of the difficulties with MCMC for trees. 



2 Dynamic Regression Trees 

We now redefine partition trees as a dynamic model for predictive uncertainty. We introduce 
model state % which, following from Section [TTT| includes the recursive partitioning rules asso- 



ciated with X*, the set of covariates observed up-to time t. Section 2.1 defines the mechanisms 
of state transition Tt-i — t- 7f as a function of x^, the newly observed covariates, such that % 
is only allowed to evolve from Tt-i through a small set of operations on partition structure in 
the neighborhood of x^ (see Figure [T|). That is, we specify a prior distribution for the evolu- 
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tion, p(7f I 7t_i,x*). Section 2.2 then shows how to obtain tree likelihood p(?/* | Tt,x^) as a 



product of leaf node marginal likelihoods, and hence allows us to assign posterior weight over 



the discrete set of potential trees generated through the evolution prior. Section 2.2 also de- 
tails the conditional predictive distribution and describes model particulars for three regression 
leaves: constant, linear, and multinomial. A particle learning algorithm for on-line posterior 



simulation is outlined in Section 2.3 In this, each particle consists of a tree (partitioning rules) 
and sufficient statistics for leaf node predictive models, and our filtering update at time t com- 
bines small tree changes around with a particle resampling step that accounts for global 



uncertainty. Finally, we describe marginal likelihood estimation in Section 2.4 



2. 1 Partition Evolution 

This section will specify tree dynamics through the evolution equation p(7t | 7f_i, x*). Keep- 
ing possible changes localized in the region relevant to a new observation x^, the tree evolution 
7t-i — > 7t is defined through one of three equally probable moves on leaf node r]{xt): 

stay: The tree hierarchy remains unchanged, and 7t = Tt-i. 

prune: The tree may be pruned by removing //(x^) and all of the nodes below and 
including its sibling S{r]{^t))- Hence, parent P{r]{-xt)) becomes a leaf node in Tt. If Tt-i 
is null (contains only a root), then the prune move is invahd since ri{-Kt) is parentless. 

grow: The grow move creates a new partition within the hyper-rectangle implied by split 
rules of the ancestors of r]{xt). It consists of uniformly choosing split dimension j and 
split point xf""" (within an interval determined by {j,r]{xt),xt} as described below), 
and then dividing 77 (xj according to this rule. The former leaf containing Xj becomes an 
internal node in 7*, acting as parent to two new leaf nodes (one of which contains x^). 

In this way, tree changes are restricted to the neighborhood of ^^(xf) illustrated in Figure[TJ we 
will see in Section 23]that this feature is key in limiting the cost of posterior simulation. 
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Given these three possible moves, we can now define an evolution prior as the product of 
two parts: a probability on each type of tree move and a distribution over the resultant tree 
structure. In the former case, we assume that possible moves among stay, prune, and grow 
are each a priori equally likely (e.g., Pmovc = 1/3 for each when all are available), although 
one may use other schemes if desired. For the latter distribution, we build on the inferential 
framework of Section 1 .2 and assume a CGM prior for tree structure such that 7r(7t) is as in (j2]) 
based onpspiit(7I,77) = a(l + D^)"^. We then have p(7^ | 7I-i,x*) oc Emex(x*)Pm^('^")' 
where T"* is the tree that results from applying move m to 7t-i and (x*) is the set of possible 
moves. Hence, one may view p(7t | 7t-i, x*) as a covariate dependent prior for the next tree, 
where the penalty on general tree shape, n(Tt), is constant for all t but the set of new potential 
trees you are willing to entertain is dictated by x^. 

It is important to note that, in contrast with staying or pruning, the grow move actually 
encompasses a set of tree evolutions: we can split on any input dimension and, for each di- 
mension j, the set of possible grow locations is the interval {xf""" : /J*^'"-' < x^""" < mJ*^^*^} 
where Ij^^*'' and u^^^^^ are sup and inf points such that a minimal number of observations in 
ri{-Kt) U {x(} have j^^ coordinate less than l^^^^^ and strictly greater than wj*^^*"* (i.e., such that 
each new leaf satisfies minimum data restriction for the leaf regression model, as described in 



Section 2.2). Combining these grow mechanics with our discussion on prior specification, the 
implied conditional prior for moving from 7t-i to a specific grown tree % is the product of 
Pgrow (i-e-5 usually 1/3), the penalty term 7r(7t), and the probability of the specific split location 
used in this grow move. Viewed another way, the marginal prior probability of growing in any 
manner is Pgrow7r(7t) integrated over Tt with respect to the measure for possible split locations. 

We place a conditional uniform prior distribution over the above set of candidate split lo- 
cations. Hence, j is uniformly distributed over covariate dimensions with non-empty sets of 
possible grow points and, given j, the split point is assigned a uniform prior on [/J*^''*'', mJ*^^*^]. 
This distribution is uniform over eligible covariate dimensions, regardless of variable scaling 
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within each dimension. As an aside, since the likelihood is unchanged for all grow moves 
which result in equivalent new leaf nodes - that is, the model is only identified up to location 
sets which separate elements of r]{xt) - an alternative prior would restrict split locations to 
members of r]('Xt) (such that every possible grow changes the likelihood). We prefer the former 
option since it leads to smoother posterior predictive surfaces in our particle approximations. 

The formulation in this section has many things in common with the original CGM par- 
titioning approach. Indeed, both involve similar grow and prune moves. However, in CGM 
these are merely MCMC proposals, whereas in dynamic trees they are embedded directly into 
the definition of the sequential process. Also, analogues of change and swap are not present in 
our sequential formulation. Although these could have been incorporated, we felt that limiting 
computational cost was more desirable. In particular, there are two aspects of our framework - 
a global particle approach to inference and a filtered posterior that changes only incrementally 
with each new observation - which allow for the convenience of a smaller set of tree rules. 



2.2 Prediction and Leaf Regression Models 

Our dynamic tree model is such that posterior inference is driven by two main quantities: the 
marginal likelihood for a given tree and the posterior predictive distribution for new data. This 
section establishes these functions for dynamic trees, and quickly details exact forms for our 
three simple leaf regression models. 

First, with each leaf r] G parametrized by ~ 7r(^^), the likelihood function is available 
after marginalizing over regression model parameters as 



This is combined with the conditional prior of Section 2. 1 to obtain posterior p(7t | [x, yY,Tt 
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Second, the predictive distribution for yt+i given x^+i, Tt, and data [x, yf, is defined 



V{yt+i I Xi+i, Tu [x, y]') = Y>{yt+i I [x, (4) 

= y"p(yi+i|x,+i,^^)dP(^^|[x,i/r('''-^^)) 

where node rjiyit+i) G is the leaf partition containing x^+i (obtained by passing through 
the split rules at each level of %) and dP is the posterior distribution over leaf parameters 
given the data in rjiyit+i)- That is, the predictive distribution for covariate vector x is just the 
regression function at the leaf containing x, integrated over the conditional posterior for model 
parameters. In a somewhat subtle point, we note that a modeling choice has been made in our 
statement of Q: prediction at x^+i depends only on rather than averaging over potential 
7f +1 as would be necessary for a conventional state-space model. Our predictive function in (|4]) 
leads to substantially more efficient inference (the alternative is impractical in high dimension), 
and it seems to us that the distinction should not produce dramatically different results. 

We concentrate on three basic options for leaf regression: the canonical constant and lin- 
ear mean leaf models, and multinomial leaves for categorical data. From ([3]) and (|4]), we see 
that conditioning on a given tree reduces our necessary posterior functionals to the product of 
independent leaf likelihoods and a single leaf regression function, respectively. Ease of imple- 
mentation and general efficiency of our models depends upon an ability to evaluate analytically 
the integrals in (|3]) and (|4]), such that prediction and likelihood are always marginalized over 
unknown regression model parameters (given a default scale-free prior specification). Fortu- 
nately, such quantities are easily available for each leaf, and the following three subsections 
quickly outline modeling specifics for our three types of regression tree. 

Before moving to leaf details, we note that the complexity of leaves is limited only by 
computational budget and data dimensions. For example, the treed Gaussian process (TGP) 



approach of Gramacy and Lee (2008) fits a GP regression model at each leaf node. The tree 
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imposes an independence structure on data covariance, providing an inexpensive nonstationary 
GP model. However, unlike the models proposed herein, GP leaves do not allow inference 
to be integrated over regression model parameters. This significantly complicates posterior 
inference, leading to either reversible-jump methods for MCMC or much higher-dimensional 
particles for sequential inference (which can kill algorithm efficiency). Thus, although dynamic 
trees can be paired with more complex models, the robust nature of less expensive trees will 
lead to them being the preferred choice in many industrial applications. Indeed, our results in 
Section |3] indicate that extra modeling does not necessarily lead to better performance. 

2.2.1 Constant Mean Leaves 

Consider a tree Tt partitioning data [x, y]* at time t, and let 1 77] < t be the number of observations 
in leaf node 77. The constant mean model assumes that y'^ = {yi : Xj G 77} are distributed as 

yl...,y^^^'^N{^^„al). (5) 

Under this model, leaf sufficient statistics are yr^ = ^j^^ Hi /\v\ ^i^d = Yl^J^liiUi " Vv^ = 
Z]l=i(l/I')^ ~ \v\y'^^ where our prior forces the minimum data condition \ri\ > 2. Note that these 
statistics are easy to update when the leaf sets change. 

Under the motivation of an automatic regression framework, we assume independent scale- 
invariant priors for each leaf model, such that 7r(yU^, cr^) oc l/cr^. The leaf likelihood is then 
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For covariate vector x allocated to leaf node r], the posterior predictive distribution is 



1 + ^ 

V{y I x,r/, [x,?/]'') = p{y I r],y'') = St\y; y^, ^, , _ ^ s^, jr^l - 1 j , (7) 



where St is a Student-t distribution with — 1 degrees of freedom. 
2.2.2 Linear Mean Leaves 

Consider extending the above model to a linear leaf regression function. Leaf responses y"^ are 
accompanied by design matrix X,, = [x^, . . . , xj'^|]', where {x^}[!!l]^ = x* fl r/, and 

?/''~N(/i^l|,l+X,/3^,aJl|,|) (8) 

for intercept parameter /x^, rf-dimensional slope parameter and error variance cr^. Sufficient 



statistics for leaf node rj are then and s^, as in Section 2.2.1 , plus the covariate mean vector 



x^ = X]l=i ^i'/l'?!' shifted design matrix X^, = [x^ — x^, . . . , xj'^i — x^]', Gram matrix = 
X'^X^ and slope vector = ^,y^(X^)'(?/'' — |/^) for the shifted design matrix, and regression 
sum of squares TZr, = We now restrict \ri\ > d + 2. As before, these statistics are easily 

updated; in particular, matrices can be adapted through partitioned inverse equations. 

We again assume the scale invariant prior 7r(/x^, a.^) oc and, as above, it is straight- 
forward to calculate the essential predictive probability and marginal likelihood functions. The 
marginal likelihood for leaf node r] is then 
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For X allocated to leaf node rj and with x = x — x^, the posterior predictive distribution is 



P(l/ I [x,y]'') = St^^y; y^ + x'/?^, (l + lr^l ^+x'^^^x) 



TZr 



\ri\ —d—1 



d-1 ]. (10) 



2.2.3 Multinomial Leaves 

For categorical data, as in ([T]b), each leaf response is equal to one of C different factors. 
The set of responses, y^, can be summarized through a count vector = [z^, . . . , z^]', such 
that = Yl^gii t{y] = c). Summary counts for each leaf node are then modeled as 



~ MN(p^, \r]\ 



(11) 



where MN(p, n) is a multinomial with expected count Pc/n for each category. 

We assume a default Dirichlet Dii(lc/C) prior for each leaf probability vector, such that 
posterior information about is summarized by = (z^ + l/C)/(|r7| + 1), which is trivial 
to update. The marginal likelihood for leaf node t] is then 



(12) 



Covariates x allocated to leaf node i] lead to predictive response probabilities 



viy = c\ X, r], [x, y]"^) = p{y = c \ z^) = p^?, for c = 1, . . . , C. 



(13) 



2.3 Particle Learning for Posterior Simulation 

Posterior inference about dynamic regression trees is obtained through sequential filtering of a 
representative sample. The posterior distribution over trees at time t — 1 is characterized by N 
equally weighted particles, each of which includes a tree T}1\ encoding recursive partitioning 
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rules and 5'f!_\, the associated set of all sufficient statistics for each leaf regression model. Upon 
the arrival of new data, particles are first resampled and then propagated to move the sample 
fromp([r,5]t_i I [x,y]*-i)top([r,5]t | [x,y]*). 

In particular, our approach is a version of the particle learning (PL) sequential Monte Carlo 
algorithm introduced by [Carvalho et al.| ( [2010a| ) in the context of mixtures of dynamic linear 



models. The methods in this section are specific to dynamic regression trees and we defer 



further detail to the original PL paper (in addition, Carvalho et al. , 2010b focuses on PL for 
general mixtures and thus describes a partition model with parallels to trees). In the context of 
our dynamic trees, PL recursive update equations are 

p{[T,S]t\[^,yY) = Jp{[T,S]t\[T,S]t^,,[^,y]t)dF{[T,S]t-i\[^,yY) (14) 

oc I p{[r,s]t I [r,s]t-i,[^,y]t)p{[^,y]t I [r,s]t-i)dp{[r,s]t-i \ [x,y]*-^), 

where the second line is due to a simple application of Bayes rule. Hence, the particle set 
{[T, S]fl-^}f^i ~ p([T, S]t-i I [x, y]*"^) can be updated by first resampling particles propor- 
tional to p([x, y]t I [T, S]t^i), and then propagating from p([T, S]t \ [T, S]t-i, [x, y]t). 



Although wider questions of PL efficiency are addressed in [Carvalho et al.| ( |2010a[ ), we 
should make some quick notes on the tree-specific algorithm. Crucially, resampling first to 
condition on [x, yf introduces information from [x, y]t that is not conditional on 71, thus reduc- 
ing direct dependence of particles on the high dimensional tree-history T* (this same idea is 
found in the best particle filtering algorithms, such as [Kong et al.| ( |1994] ) and [Pitt and Sheph^d| 



( 1999 ), even if it remains standard to track and weight all of T*). Second, any analytical integra- 
tion over model parameters will greatly improve Monte Carlo sampling performance through 
more efficient Rao-Blackwellized inference. In our case, simple leaf models allow direct eval- 
uation of p( 71 1 [x, y]*) = /p(7;,{^^ : V e LrJ|[x,y]*)dP({0^ : 7] e LrJIfx,?/]*); without 
such marginalization, particles would need to be augmented with all leaf parameters. Finally, 
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since only the neighborhood of ri{-xt) changes under any of these moves, it is possible to ignore 
all other data and tree structure when calculating relative probabihties for tree changes. This, 
combined with grow moves that are only identified up to a discrete set of locations, restricts 
particle updates to a low dimensional and easily sampled space. 

In full detail, suppose that {7^^i}^i ~ p(7t_i | [x, y]*^^) is a particle approximation to the 
posterior at time t — 1. The dimension of each [T, S]f^ depends upon both the tree structure and 
the regression model, with each particle containing | sets S^l^ of the leaf sufficient statistics 



described in Section 2.2 All trees/particles begin empty. Due to minimum data restrictions for 
each partition, stay is the only plausible tree move until enough data has accumulated to split 
into two partitions. Hence, updating begins at t = 6 for the constant model and at t = 2{d + 2) 
under linear leaves, with each Tt ^^ consisting of a single leaf/root node. 
Upon observing new covariates xt and response yt, update as follows. 

Resample: Draw particle indices {({i)}fL^ with predictive probability weight 

p(C(i) = i) (X p{yt I Xt, [T, Sft'l^) = p{yt \ x^, 



as in (|7]), ([To]), or (13). There are various low-variance options for resampling (e.g.. 



Cappe et al. 2005 ), and our implementation makes use of residual-resampling. 



Set I = % I for each i to form a new particle set. 

Propagate: We need to update each tree particle with a sample from the discrete distribu- 
tion p([T, I [T,S]t-i, [x,y]t) oc p{Tt I 7^-i,x*)p(?/* I 71, X*). First, propose changes 
for T}1\ —7- T}^'' via stay, prune, or a randomly sampled grow move on r7t^l\(xt). For 



each particle's grow, we propose from the uniform grow-prior of Section 2.1 by drawing 



j from eligible dimensions (i.e. those with non-empty sets of split locations) and then 
sampling the split point from a uniform on [l^^^^\u^^^*'^]. 

The three candidate trees are now % G {T^'^^y, 7-prune^ 7-grow|_ 
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Since candidate trees are equivalent above P{ri{-xt)), the parent node for xj on tree Tt-i, 
we calculate posterior probabilities only for the subtrees rooted at this node. With candi- 
date subtrees denoted 7^'^°™ and containing data [x, (including x^ and yt), the new % 
is sampled with probabilities proportional to 7r(7^™°^'^)p(?/* | x*, 7^"!°™). Here, the prior 
penalty is ([2]) and the likelihood is ([3]) with leaf marginals from ([6]), (|9]), or ( 12). 

Finish with deterministic sufficient statistic updates — )■ 5*1*^ 

These two simple steps yield an updated particle approximation {7^*^*^}^^ ~ p(7t | [x, y]*). 

In an appealing division of labor, resampling incorporates global changes to the tree poste- 
rior, while propagation provides local modifications. As with all particle simulation methods, 
some Monte Carlo error will accumulate and, in practice, one must be careful to assess its ef- 
fect. However, as mentioned above, our strategy makes major gains by integrating over model 
parameters to obtain particles which consist of only split rules and sufficient statistics. Given 
this efficient low-dimensional particle definition, our resampling-first procedure will sequen- 
tially discard all models except those which are predicting well, and this tempers posterior 
search to a small set of plausible trees with good predictive properties. We will see in Section 
|3]that PL for trees has some significant advantages over the traditional MCMC approach. 

2.4 Marginal Likelihood Estimation 

An attractive feature of our dynamic trees is that, due to the use of simple leaf regression 
models, reliable posterior marginal likelihood estimates are available through the sequential 
factorization p(y^ | x^) = nt=iP(^/t I ^t,[^,yY^^)- However, use of improper priors in 
constant and linear trees implies that the marginal likelihood is not properly defined. As de- 
scribed in I Atkinson] ( | 1 978| ) , this can be overcome if some of the data is set aside, before any 



model comparison, and used to form proper priors for leaf model parameters. Such "train- 
ing samples" are already enforced within our dynamic tree evolution through the minimum 
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partition- size requirements for each model. As such, a valid approximate marginal likelihood 
is available by conditioning on the first observations, where is large enough to provide 
a proper posterior predictive distribution for the unpartitioned "root" tree. Hence, we write 
p{y^ I x-^) ^ Y[t=to+i P(^* I i-^' vY^^) where the factors are approximated via PL as 

1 ^ 

E[p(?/, I x„[x,y]*-\7;_i)p(7;_i I [x,!/]*-^)] ^ ^5^P(1/* I x„[r,^]2i). (15) 

i=l 



This is just the normalizing constant for PL resampling probabilities (refer to Section [23 ) 



These marginal likelihood calculations make it possible to compare leaf model specifica- 
tions through the marginal likelihood ratio or Bayes factor (BF). For example, in a comparison 
of linear leaves against a constant leaf model, pim{y^ \ x^)/pconst(y^ | x-'") measures the 
relative evidence in favor of linear leaves (see Kass and Raftery[ |1995[ for information on as- 



sessing BFs; also, note that the posterior probability of linear leaves is pim{y^ \ x"^) / {pimiv^ \ 
+ Pconst(2/'^ I x^)) given even prior probability for each model). Linear and constant 
leaves are just extreme options for the general model choice of which covariates should be leaf 
regressors, and the BF approach is generally applicable in such problems. 

This approach to inference is data-order dependent, due to the state-space factorization 
assumption p{yt \ x^,y^~^) = p{yt \ x*,?/*~^). In addition, BFs are only available conditional 
on the initial y*" training sample, and clearly for marginal likelihood estimation both training 



sample and data-order must be the same across models. However, we will see in Section 3.1 
that these BFs lead to consistent model choices in the face of strong evidence, and that average 
BFs over repeated reorderings and training samples serve an effective model selection criterion. 

3 Application and Illustration 

We now present a series of examples designed to illustrate our approach to on-line regression. 



Section 3.1 considers two simple 1-d regression problems, illustrating both constant and linear 
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mean leaf models over multiple PL runs, before comparing dynamic tree predictive perfor- 
mance against competing estimators in a 5-d application. The next two sections focus on the 



sequential design of experiments, with optimization problems in Section 3.2 and active learning 



in Section |3]3 Finally, Section 3.4 describes the application of dynamic multinomial leaf trees 



to a 15-d classification problem. Throughout, our dynamic trees were fit using the dynaTree 
package ( [Gramacy and Taddy] |2010b[ ) for R under the default parametrization. In particular. 



the tree prior of Section 1.2 is specified with ot = 0.95 and (3 = 2; inference is generally robust 
to reasonable changes to this parametrization (e.g., the four settings described in Figure 3 of 
CGM (2002) lead to no qualitative differences). 

3. 1 Regression Model Comparison 

We begin by considering two simple 1-d applications. The first "parabola data" set has 100 
observations from y{x) ~ N(x + x^, 1/5^), where x was drawn uniformly from [—3, 3]. The 
second "motorcycle data" set, available in the MAS S library for R, consists of observed accel- 
eration on motorcycle riders' helmets at 133 different time points after simulated impact. 
The top two rows of Figure [2] show 30 repeated filtered posterior fits for random re-orderings 



of the data, obtained through the PL algorithm of Section 2.3 with 1000 particles. The first row 
corresponds to constant leaf models while the second row corresponds to linear leaves, and 
each plot shows posterior mean and 90% predictive intervals. Although there is clear variation 
from one PL run to the next, this is not considered excessive given the random data orderings 
and small number of particles. Linear leaf models appear to be far better than constant models 
at adapting to the parabola data. With the parabola data, a constant leaf model leads to higher 
average tree height (9.4, vs 5.4 for linear leaves) while the opposite is true for the motorcycle 
data (average constant leaf height is 10.5, vs 12.8 for linear leaves). 

A more rigorous assessment of the relative evidence in favor of each leaf model is possible 



through estimated Bayes factors, as described in Section [24} Figure [3]presents filtered log BFs 
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comparing linear to constant leaves, calculated for the 30 random data orderings used in Figure 
|2]and conditional on the first to = 5 observations. As discussed in |2.4[ due to both random 



ordering and different y*" training samples, the BF estimates are not directly comparable across 
runs. However, for the parabola data, in every case the linear model is clearly preferable. 
For the motorcycle data, although there is no consistent evidence in favor of either model, the 
majority of runs produce log BF values below zero and the mean across runs (which eliminates 
order dependence as the number of runs increases) shows fairly strong evidence against the 
linear model. This agrees with the visually similar regression fits drawn in Figure [2} which 
were obtained after fewer average partitions under constant leaves than with the linear leaves. 

Finally, the bottom row of Figure |2] compares the means of the 30 dynamic tree fits to 
two similar modern nonparametric regression techniques: treed Gaussian processes (TGP) and 
Bayesian additive regression trees (BART; [Chipman et al. , |2010 ). Each new model is designed 



as an extension to constant or linear regression trees, and makes steps to partially alleviate the 
mixing problems of CGM's original MCMC inference. As mentioned previously, TGP takes 
advantage of a more flexible leaf regression model to allow for broader covariate partitions. 
In a different approach, BART proposes a mixture of relatively short trees, and the authors 
show that combinations of simple individual partitioning schemes can lead to a complicated 
predictive response surface. The TGP model is given 10 restarts of a 10,000 iteration MCMC 
run, while BART results are based on a single 1,100 iteration chain. For the parabola data, all of 
the models find practically identical fits, except for the poorly performing constant leaf dynamic 
tree. However, for the motorcycle data, we see that each model leads to mean functions that are 
very similar, but that posterior predictive 90% intervals for BART and TGP appear to variously 
over or under estimate data uncertainty around the regression mean. In particular, BART's 
global variance term is misspecified for this heteroskedastic data. 

To benchmark our methods on a more realistic high-dimensional example, we also consider 
out-of-sample prediction for a Friedman test function (originally designed to illustrate multi- 
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Figure 2: Top two rows show, for constant and linear leaves, posterior mean and 90% interval 
for each of 30 runs with 1000 particles to randomly ordered data. Bottom row shows mean and 
90% interval for the average across dynamic tree runs, as well as for BART and TGP. 
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Figure 3: Filtered log Bayes factors for linear against constant leaf tree models, calculated 
30 random data orderings (the mean is shown in black) with 1000 particles. 
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variate adaptive regression splines (MARS) in Friedman, 1991 1. The response is 10 sin(7rxiX2) + 
20(^3 — 0.5)^ + 10x4 + with N(0, 1) additive error. Our comparators, and their R im- 
plementations, are: treed constant (TC) and linear (TL) models, and GP models from tgp 
HGramacy and Taddyj [2009] ); MARS from mda ( [Leisch et alj [20091 ); Random Forests (RF) 



from randomForest ( |Liaw and Wiener| |2002[ ); BART from BayesTree ( [Chipman and 



McCuUoch, 2009); and neural networks with 2-3 hidden layers from the nnet library. 

For this experiment, 100 random training and prediction sets, of respective size 200 and 



1000, were drawn with inputs uniform on [0, 1] . Following Chipman et al. (2002), we measure 



performance through predictive root mean-square error (RMSE) to the true mean, and results 
are shown in Figure |4j Dynamic tree models (DTC/DTL) were fit in a single (random) pass 
with N = 1000 particles, and the number of MCMC iterations and restarts, etc., for other 
Bayesian methods was such that Monte Carlo error in RMSE for repeated runs on the same data 
is negligible against the results in Figure |4} Non-Bayesian comparators were fit under package 
defaults. The dynamic versions of the tree models clearly outperform their static counterparts, 
and the dynamic treed linear model (DTL) performs nearly as well as or better than the GP and 
BART, both of which offer flexible mean functions under a (true) constant variance term. 

3.2 Search Optimization 



In Taddy et al. (2009), TGP regression was used to augment a local pattern search scheme 



with ranked lists of locations that maximize a measure of expected improvement. The GP 
leaf model was successful in this setting - deterministic optimization with 8 inputs - but pair- 
wise covariance calculations and repeated MCMC runs became unwieldy in higher dimen- 
sions. At the same time, such hybrid optimization schemes are most useful in complicated 
high-dimensional settings: although true global optimization is impossible without great ex- 
pense, a hybrid method uses regression to locate promising input areas and move the local 
search appropriately, thus iterating towards robust optimality. 
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Figure 4: Predictive RMSE on the Friedman data; sample of 100 for each estimator. 



We propose that our dynamic regression trees, due to a flexible on-hne inference framework, 
provide a very attractive regression tool for hybrid local-global search optimization. Although 
they are not interpolators, and thus of limited usefulness in deterministic optimization, both 
constant and linear regression trees provide an efficient model for the prediction and optimiza- 
tion of stochastic functions. Such stochastic optimization problems are common in operations 
research and, despite being deterministic, many engineering codes are more properly treated 
as stochastic due to numerical instability. This section will outline use of dynamic regression 
trees for the global component of a stochastic optimization search. 

We seek to explore the input space in areas that are likely to provide the optimum response 
(by default, a minimum). In the deterministic setting, Jones et al. ( |1998| ) search for inputs 
which maximize the posterior expectation of an improvement statistic, max{/mm — /(x)) 0}- 
In more generality, improvement is just the utility derived from a new observation. For global 
search optimization, there is no negative consequence of an evaluation that does not yield a 
new optimum (this information is useful), and the utility of a point which does provide a new 
optimum is, say, linear in the difference between mean responses (other quantiles or moments 
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are also possible). Thus, the improvement for a stochastic objective y(x) after t observations 
is /(x) = maxjyinin — y(x),0} , where y(x) is the regression model mean function (e.g., 
^(x) = firi for constant leaves or y(x) = /x^ + x'/3^ for linear leaves) and ^min is this expected 
response minimized over the input domain. 

In hybrid schemes, we have found that maximizing E[/(x)] can be overly myopic. We thus 
incorporate an active learning idea (refer to Section [33] ), and instead search for the maximizing 
argument to G{x; 0) = E[/(x)] + sd(y(x))/0, where is a precision parameter that can be 
decreased to favor a more global scope. Conveniently, use of constant or hnear leaves allows 
for both E[/(x)] and sd(y(x)) to be calculated in closed form conditional on a given tree. In 
particular, with St(a^(x), &,y(x), Crj) the posterior for y(x) given tree % and ymin fixed at the 
minimum for a^(-) over our input domain, expected improvement E[/] is (suppressing x) 



mm '^jy)Tc^ 



Z/min (^rj 



+ 



Crj-l 



Cri + 



(y 



mm (^Ti) 



1/min (^ri 



with Tc and tc the standard t cumulative distribution and density, respectively, with c degrees of 
freedom (this is similar to the marginal improvement derived in Williams et al.[[2000[ ). Finally, 



since E[J] = ^ ^iL^ E[/ | T^*)] and Var(y) = E[Var(j/ | T)] + Var(E[y | T]), it is possible 
to obtain G = E[/] + sd(y)/0 by evaluating the appropriate functionals conditional on each 
T G {T}^^}fLi before taking moments across particles (see Section 3.3 for further detail). 



In a generic approach to sequential design, which is also adopted in the active learning 



algorithms of Section 3.3, the choice of the "next point" for evaluation is based on criteria 



optimization over a discrete set of candidate locations. After initializing with a small number 
of function evaluations, each optimization step augments the existing sample [x, by drawing 
a space-filling design (we use uniform Latin hypercube samples) of candidate locations, X = 
{xi}fLi, and finding x* G X which maximizes G(x*). Next, evaluate the new location, set 



[x(+i,?/t+i] = [x*,?/(x*)], and follow the steps in Section 2.3 to obtain an updated particle 
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approximation to the tree posterior. This is repeated as necessary. 

Figure [5] shows some results for mean optimization of the test function y = sin(x) — 
Cauchy(x; 1.6, 0.15) + e, with e ~ N(0,o" = 0.1), for precision values 0=1 and (p = 
10. The fitted regression trees assume linear leaves and, in each case, we used 500 particles 
and 100 candidate locations. The role of (p in governing global scope is clear, as the higher 
precision search is quickly able to locate a robust minimum but does not then proceed with 
wider exploration. The linear leaf model seems to be efficient at fitting our test function. 

We also consider mean optimization of the two dimensional exponential function y ~ 



N(a;i exp(— a;^ — x^), 10^ ), which is discussed further in Section 3.3 Given data from a ran- 
dom initial sample of 10 locations, our search routine was used to obtain the next 10 function 
evaluations, with = 1, 10, or 100, using 1000 particles and candidate sets of 200 locations. A 



TGP-based search optimization, as described in Gramacy and Taddy (2010a), was also applied 



This routine is the same as that outlined above for dynamic trees, except that it maximizes E[/^] 
rather than G, where each iteration requires a new MCMC run (we use 500 iterations after a 
burn-in of 500), and the candidate set is augmented with the location minimizing predicted 
response (such adaptation will often lead to a more efficient search). In this, the g parameter 



increases the global scope (refer to Schonlau et al. , 1998) and we consider g = 1, 5, or 10 



Each optimization scheme was repeated 50 times, and the summary is shown in Table [T] 



In a result that will parallel findings in Section 3.3, constant, rather than linear, leaves 
lead to a better exploration of the 2-d exponential function; this relative weakness exposes a 
difficulty for linear models in fitting the rapid oscillations around (0, 0). Interestingly, constant 
trees also lead to better results than for the TGP optimization routine. This is despite the added 
flexibility of GP leaves, an augmented candidate sample, and use of a wide range of g values 
for E[/^] (which, although harder to compute when marginalizing over model parameters, is 
the literature standard for this type of search). In addition, due to repeated MCMC runs, the 
TGP algorithm requires near to 10 times the computation of sequential tree optimization. 
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Figure 5: Optimization with dynamic regression trees. For each iteration, posterior mean and 
90% interval are on top (with true function and data in grey, and "next point" in green), while 
the bottom plot has E[J] in green, sd(y) in blue, and G = E[/] + sd(y)/0 in red. 
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Table 1: Minimization of the 2-d exponential function. For each optimization scheme, the top 
number is mean solution and the bottom number is standard deviation (out of 50 runs). 
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3.3 Active Learning 

Another common application of sequential design is focused on evaluating an unknown re- 
sponse surface - the mean function /(x) in (l.a) - at its most informative input location. Such 
active learning procedures are intended to provide efficient automatic exploration of the covari- 
ate space, thus guiding an on-line minimization of prediction error. As in the search optimiza- 
each active learning iteration will draw candidate locations, X = {xi}f£i. 



tion of Section 



3.2 



and select the next (t + 1^*) design point to be x* G X which maximizes a heuristic statistic. 
However, instead of aiming for a function optimum, we are now solely interested in under- 
standing the response surface over some specified input range. Hence, we need to choose x* to 
maximize some measure of the predictive information gained by sampling y{x*). 



Two common heuristics for this purpose are active learning MacKay (ALM; 1992) and 



active learning Cohn| (ALC; |1996 ). An ALM scheme selects the x* that leads to maximum 



variance for ?/(x*), whereas ALC chooses x* to maximize the expected reduction in predictive 
variance averaged over the input space. A comparison between approaches depends upon the 
regression model and the application, however it may be shown that both approximate a max- 
imum expected information design and that ALC improves upon ALM under heteroskedastic 
noise. Both heuristics have computational demands that grow with |X|: ALC requires time 



in 0(|Xp), wheres ALM is in 0(|X|). Gramacy and Lee (2009) provide further discussion 



of active learning as well as extensive results for ALC and ALM schemes based on MCMC 
inference with CGM's static treed constant/linear (TC/L) models, GPs, and TGPs. 

Since active learning is an inherently on-line algorithm, it provides a natural application for 
dynamic regression trees. The remainder of this section will illustrate ALC/ ALM schemes built 
around particle learning for trees, and show that our methods compare favorably to existing 



MCMC -based alternatives. As in |3.2[ both constant and linear leaf models lead to closed- form 
calculations of heuristic functionals conditional on a given tree. Fast prediction allows us to 
evaluate the necessary statistics across candidates, for each particle, and hence find the optimal 
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X* under our heuristic. Trees are then updated for y(x*), and the process is repeated. 

ALM seeks to maximize Var(?/(x)), which is fairly simple to calculate. For a given tree 
Tt such that x is allocated to leaf node rj G Lt^, let /i,,(x) = E[|/(x) | rj] and v^(x) = 
Var(?/(x) I T]) denote the conditional predictive mean and variance respectively for y(x). These 
are available from equations ^ and (10) for each leaf model as /U,y(x) = or yU^(x) = 
^,+x%andv,(x) = s2(l+l/|r/|)/(|r/|-3) orv,(x) = [4-7^,] (l + I / \r^\ + ^ g-^ii) /{\r^\- 
d — 3). Given the particle set {Tt^^}f^i, the unconditional predictive variance is 



Var(?/(x)) = E[Var(y(x) | T)] + Var(E[?/(x) | T]) 



1 



■ N 



vr'(x) + /ij*^(x) 



1 ^ 
1=1 



(16) 



which is straightforward to evaluate for all x G X during our search for the maximizing x*. 

The ALC statistic is more complicated. Let A(T^(x') = Var(y(x')) — Var(y(x') | x') 
denote the reduction in variance at x' when the design is augmented with x. Since ?/(x) is 
not observed, the ALC statistics must condition on the existing model state. Hence, we define 
A(Tx(x') = E[A(Tx(x' I Tt)] at time t in terms of the conditional variance reduction 



Aa2(x' I Tt) = Var(y(x') | %) - Var(y(x') | x, 71). 



(17) 



Expressions for A(Tx(x' | Tt) may be obtained as a special case of the results given by Gramacy 
and Lee (2009). Clearly, Acr^(x' | ?<) = if x and x' are allocated to different leaves of %■ 



Otherwise, for x and x' allocated to G Lji in a given tree, we have that A(t^(x' | Tt) 
A(Tx(x' I -rf) is (for constant and linear leaves respectively) 



\r]\ — 3 



X 



I'?! 



or 



X 



m 



l + R + (^')'^.-^x' 



(18) 



where x = x — x^ and similarly for x'. In practice, the integral Acr^(x) = A(j^(x') dx! is 
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Figure 6: Active learning comparison for the parabola (left) and motorcycle data (right). The 
top row shows Var(?/(x)) for ALM and the bottom shows a/A(j^(x) for ALC. 

approximated by a discrete sum over the random space-filling set X. Hence, x* is the candidate 
location which maximizes the sum of Acr^* (x' | Tt) over both x' G X and Tt G {Tt^^f^x- 



We use the parabola and motorcycle data examples of Section 3.1 to compare ALC and 
ALM for both constant and linear leaf dynamic trees. Figure [6] illustrates the statistics associ- 
ated with each combination of the two heuristics and two regression models, evaluated given 
the complete datasets. For the parabola data (left column), the ALM and ALC plots look 
roughly similar and, in each case, the linear model statistics are much more flat than for the 
constant model. In contrast, the ALM and ALC plots are very different from each other for the 
motorcycle data, which exhibit heteroskedastic additive error. 

For a further comparison we return to the sin/Cauchy data from Section 3.2 Figure |7] 
shows sequential design progress under our dynamic tree model with linear leaves and the 
ALC heuristic in three snapshots: after an initial t = 10 Latin hypercube (LHS) sample, after 
15 samples taken via ALC (t = 25) and after 45 ALC samples (t = 55). With the exception of 
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t = 10 t = 25 t = 55 
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Figure 7: Snapshots showing the progression of the ALC heuristic driven by the dynamic tree 
model with linear leaves on the sin/Cauchy test data. The top panels show the true surface 
(grey lines), sampled input/output pairs (grey points), and the dynamic tree predictive means 
and 90% quantiles (red lines). The bottom panels show the corresponding Acr^(x) surface. 
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0.2458 


GP 


ALC 


0.05419 



Table 2: Comparing models and active learning heuristics on the 1-d sin/Cauchy data and the 
2-d exponential data. The tables are sorted on the fourth column (RMSE). 
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t = 10, when there is not enough data to support a split in the tree(s), the ALC statistic is high 
at points where y{x) is changing direction — ^highest where it is changing most rapidly. 

Table |2] (left) shows how the dynamic tree models compare to various static Bayesian treed 
models, including TGP, on this data. All experiments were started with an initial LHS in 
[0, 7] of size 10 followed by 40 active learning rounds. Each round uses 20 random LHS 
candidates X in [0, 7]. At the end of the 40 rounds the root mean squared error (RMSE) was 
calculated for a comparison of predictive means to the truth in a size 200 hold-out set from a 
GP-based maximum entropy (ME) design in [0, 7]. This was repeated 30 times and the average 
RMSE is shown in the table. The expensive TGP methods dominate, with dynamic trees using 
linear leaves (DTL) trailing. In both cases, ALC edges out ALM. Interestingly, while static 
treed linear models (TL) fit through MCMC are amongst the worse performers, their sequential 
alternatives are amongst the best. Neither TC or DTC are strong performers, perhaps due to the 
almost linear derivatives of our test function. Since the "interesting" aspects of this response 
are evenly distributed throughout the input space, both the GP with ALM and the offline LHS 
and ME comparators also do well. 

Table [2] (right) shows the results of a similar experiment for the 2-d exponential data from 
Section [X2| The setup is as described above, except that the initial LHS is now of size 20 and is 
followed by 55 active learning rounds. Here, the dynamic trees share the top 6 spots with TGP 
only, and clearly dominate their static analogues. Indeed, DTC is the second best performer, 
echoing our findings in Section|3]2 that the constant leaf model is a good fit for this function. In 
contrast to the results from our simple 1-d example, the offline (LHS and ME) and GP methods 
are poor here because the region of interest is confined to the bottom left quadrant of our 2-d 
input space. Finally, in all of these examples, dynamic regression trees (DTC and DTL) are the 
only methods that run on-line and do not require batch MCMC processing. 
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3.4 Classification 

Classification is one of the original applications for tree models, and is very often associated 
with sequential inference settings. Categorical data can be fit with multinomial leaf trees (as 
in |2.2.3 ) and, without an application specific loss function, the predictive classification rule 



assigns to new inputs the class with highest mean conditional posterior probability. That is, 

arg max 1 1 5^ (p^W) c = 1, . . . , C 1 , (19) 



c X 



1=1 



for a set of N particles, where each pc^^^ is the c-class probability from ( 13 1 for the leaf corre- 



sponding to X. Evaluating c(x) over the input space provides a predictive classification surface. 
We first apply our methodology to the Cushing's data, again available in the MASS library 



for R, used throughout the book by Ripley ( 1996 1 to illustrate various classification algorithms. 



This simple data set has two inputs - each patient's urinary excretion rates for tetrahydrocorti- 
sone and pregnanetriol, considered on the log scale - and the response is one of three different 
types of Cushing's syndrome (referred to as a, b, or c). We fit the multinomial leaf dynamic tree 
model to this data, and compare our results to those from a GP soft-max classifier as outlined 
in 



Broderick and Gramacy (2009), which fits independent GP models to each latent response 
yc such that pc = exp(yc)/ Ylf=i ewivi)- 

Figure [8] illustrates the outcome of this example. The dynamic tree results (top row) are 
based on a set of 1000 particles, while the GP soft-max results (bottom row) use every lOO*'^ ob- 
servation of a 10,000 iteration MCMC chain. The left column of this figure shows the syndrome 



classification surface based on maximum mean posterior probability, as in ( 19 1. Although it is 
not possible to assess whether any classifier is outperforming the other, these surfaces clearly 
illustrate the effect of axis-symmetric (tree) classification as opposed to the surface from a 
radial-basis (GP) model. We also note that the GP soft-max classification rule is very similar 
to results from Ripley ( 1996[ chap. 5.2) for a neural network fit with weight decay. 
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Classification 
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Figure 8: Cushing's data example. The left column shows the class corresponding to maximum 
mean posterior probability, and the right has entropy for the posterior mean probability vector. 
Results for a dynamic classification tree are on top, and for a GP soft-max classifier on the 
bottom. Class (a— t-c) and entropy rise from black to white, and plotted data is labelled by class. 

The right column of Figure js] shows the entropy, — J2ie{abc}Pi^^s{Pi)^ of the posterior 
mean probability surface for each classification model. Such entropy plots, which are com- 
monly used in classification settings to assess response variability, illustrate change in expected 
value of information over the input space. Indeed, Gramacy and Poison ( 2010[ ) propose entropy 



as an active learning criteria for classification problems, analogous to ALC in Section 3.3 In 
our application, dynamic tree entropy is peaked around a 3-class border region in the top-left. 
This intuitively offers better guidance about the value of additional information than the GP 
entropy surface, which is high everywhere except right at clusters of same-class observations. 

To better assess the performance of our dynamic tree classifier, we turn to a larger "credit 
approval" data set from the UC Irvine Machine Learning Database. This data includes 690 
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credit card applicants grouped by approval status (either '+' or '-') and 15 input variables. 
Eleven of these inputs are categorical, and for each of these we have encoded the categories 
through a series of binary variables. This convenient reparametrization, which expands the 
input space to 47 dimensions, allows direct application of the model in Section [TT| with nodes 
splitting on = or 1 for each binary variable (see |Gramacy and Taddy[ 2010a| , for discussion 
of this approach in general partition trees). As an aside, we note that the binary encoding 
can also be used to incorporate categorical data into constant and linear leaf trees; the only 
adaptation is, for linear leaves, to exclude binary variables from each regression design matrix. 

We applied the multinomial dynamic tree to 100 independent repetitions of training on 90% 
of the credit approval data and predictive classification on the left-out 10% sample. This exper- 



iment is identical to that in Broderick and Gramacy (2009), who tested both a TOP soft- max 
classifier (includes binary variables for partitioning of latent TOP processes and fits indepen- 
dent GP to continuous inputs within each partition) and a naive GP soft-max classifier (fit to the 
reparametrized 47 inputs, treating binary variables as continuous in the correlation function). 
Their results are repeated here, beside those for our dynamic tree, in Table [3| Once again, 
despite using a less sophisticated leaf model, the dynamic tree is a clear performance winner 
and the only classifier able to beat 14% average missclassification. The dynamic tree was also 
orders-of-magnitude faster than the GP and TGP classifiers (we use 1000 particles, and they 
have 15,000 iterations), and only our approach will be feasible in on-line applications. 



Classifier 


Dynamic Multinomial Tree 


TGP soft-max 


GP soft-max 


Missclassification rate 
Time (CPU hours) per fold 


0.136 (0.038) 
0.01 


0.142 (0.036) 
1.62 


0.146 (0.04) 
5.52 



Table 3: Mean (and standard deviation) for out-of-sample missclassification over 10 random 
10-fold cross-validations and computing time for each classifier on the credit approval data. 
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4 Discussion 



We have reformulated regression trees as a dynamic model. This sequential characterization 
leads to an entirely new class of models which can be fit on-Une; have a scale-free automatic 
prior specification; avoid the need for parameter sampling; and lead to robust and flexible 
prediction. We have shown empirically that our dynamic regression trees can provide superior 
performance and are less expensive than common alternatives. 

In a key point, we have been able to define particles for filtering which contain only split 
rules and sufficient statistics, leading our sequential filtering to efficiently discard all partition 
models except those which are predicting well. Due to the size and complexity of potential 
tree space, this narrowing of posterior search is an essential aspect of our models' success. In 
addition, restrictions on partition size allow us to make use of improper priors (for constant and 
linear leaves), which is not usually possible in particle inference. 

The modeling and examples contained herein provide encouragement for further work un- 
der a strategy that looks to take advantage of sequential model characterizations. In particular, 
we hope to find that similar ideas on update mechanisms for covariate dependent model features 
will lead to insight about dynamic versions of other graphical structures. 
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